ci: run tests on GPU server#747
Conversation
Codex Code ReviewNo findings. I reviewed the PR diff only: new GPU merge-queue workflow, GPU test script, Make targets, README additions, and the Verification was static only per instructions; I did not build or run tests. |
Review: GPU test suite in merge queueThis is an infra/CI PR (workflow + Makefile targets + scripts + docs). I verified the Makefile targets reference real features and test files ( No safety, correctness, or performance issues found. Two minor readability nits (posted inline):
Both are cosmetic; the workflow is sound. Note the PR's own caveat stands: as a required merge gate, a sustained Vast.ai outage (no offer after the retry loop) would block merges until it clears — an accepted tradeoff per the description. |
|
/bench-growth |
Benchmark — ethrex 20 transfers (median of 3)Table parallelism: auto (cores / 3)
Memory Growthethrex distinct-account transfers · default parallelism · 1 sample per point
Growth rate: 2993 MB / transfer (main: 2952, Δ: +1.4%)
Commit: 1d08005 · Baseline: cached · Runner: self-hosted bench |
AI ReviewPR #747 · 5 changed files Findings
Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro). AI-001: Hardcoded CUDARC_PIN/SYSROOT_DIR in the SSH remote string duplicate the script defaults
Claim The REMOTE string passed over SSH hardcodes Evidence
Suggested fix Drop the inline env from the SSH command so the script's defaults are the single source of truth, e.g. AI-002: Stale comment in scripts/gpu_test.sh still references the old `cuda_max_good>=12.8` offer floor
Claim The explanatory comment about cudarc pinning says the pin "matches the cuda_max_good>=12.8 offer floor", but the new Evidence The comment on line 42 of Suggested fix Update the comment to either: (a) drop the offer-floor reference and just explain that the pin is a known-symbol set older than the newest cudarc; or (b) reference Reviewer Lanes
Verification Lanes
Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report. Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts. |
# Conflicts: # Makefile
…, sed guard, doc nits (#777) - gpu-tests.yml: capture wait_ready's exit code in an else branch (rc=$? after 'if ...; fi' is always 0, so the timeout-vs-image-failure diagnostic never worked) - gpu-tests.yml: add ServerAliveInterval/CountMax to the test ssh session so a box that goes dark mid-suite fails in ~10 min instead of eating the 240-min job budget - gpu-tests.yml: upload the full test log as an artifact (the step log gets truncated in the UI for multi-hour runs, as the workflow itself notes) - gpu_test.sh: guard the cudarc-pin sed anchors (a silent no-op after a math-cuda Cargo.toml refactor would resurrect the fallback-latest driver-symbol panic) and restore the mutated Cargo.toml on exit so manual runs don't leave the tree dirty - Makefile: add test-prover-debug to .PHONY; correct the test-cuda-integration comment (the test asserts R1-R4 counters, not R1-R3); scope the comprehensive-cuda parity comment (CPU's merge-queue job also runs test_recursion_execute) - docs/roadmap.md: GPU FFT / Merkle tree / FRI are implemented and CI-tested, not 'Planned' - prove_elfs_tests.rs: fix 'cargo test --ignored' -> 'cargo test -- --ignored' in the ignore comment
What
Adds a GPU test suite that runs in the merge queue and blocks the merge if it fails. It rents an RTX 5090 on Vast.ai, runs the CUDA tests there, and always destroys the box — reusing the rent → provision → run → always-destroy orchestration from
benchmark-gpu.yml.Why
CPU CI (
pr_main.yaml) runs on GitHububuntu-latestrunners, which have no GPU, so the CUDA backend is never exercised before merge. This suite runs the GPU-relevant tests on real hardware.Test groups (run by
scripts/gpu_test.shvia Makefile targets)test-math-cudatest-cuda-integrationtest-cuda-fallbacktest-prover-cudalambda-vm-prover/stark/crypto/ecsmsuite on the GPU path (CPU CI's sharded prover tests, run single-threaded with--features cuda)test-prover-comprehensive-cudatest_prove_elfs_all_instructions_64_full) on the GPU pathGroups 1–3 are GPU-only (no CPU equivalent); 4–5 mirror CPU CI's prover jobs but with the CUDA path enabled. All groups run even if one fails; the job fails if any group does.
How it works
merge_group(one GPU rental per merge, not per push) +workflow_dispatch. A failure ejects the PR from the queue.scripts/gpu_test.sh, which:CUDARC_PIN=cuda-12080) and verifies the GPU toolchain (prints fullnvidia-smi),compile-programs-asm+compile-programs-rust),nvidia-smi,nvcc --version, CPU/RAM) before the test run, so the hardware is easy to find without scrolling the large test log.--yes, 3× retry, plus a label-based sweep so a cancelled-mid-rent box can't leak).Provisioning: price bands + retry across offers
The provisioning is a single Provision instance (retry across offers) step (ideas borrowed from IDP's
create-vast-server.yml), designed so a slow or flaky host can't stall the merge gate:PRICE_THRESHOLDS: "0.6 0.8 1.0"($/hr, ascending). It picks the priciest (most reliable) offer in the cheapest band that has capacity, only climbing to a pricier band when a lower one is empty. The last value ($1) is the hard cap.MAX_TRIES: 5different physical hosts. Each host gets aPROVISION_TIMEOUT: 600sreadiness budget (running + sshd accepting our key + onstart bootstrap finished); if it isn't ready in time, the box is destroyed and a different offer is tried.machine_idis tracked and excluded from later scans, so a flaky host is never re-picked.status_msgreports an image that can't be pulled / a host that can't meet the ask ("not started loading", "cannot be met", daemon errors) is abandoned immediately instead of burning the full 600s budget.OFFER_ATTEMPTS: 10times (30s apart) before giving up, to ride out the small, fast-churning RTX 5090 pool.cuda_max_good>=13.1, driver major ≥ 580.CUDA version requirement
An NVIDIA driver supporting CUDA ≥ 13.1 is required: the kernels are compiled with the toolkit's
nvccinto PTX that the driver JIT-compiles at load — an older driver rejects it withCUDA_ERROR_UNSUPPORTED_PTX_VERSION. Enforced by thecuda_max_good>=13.1offer filter and documented in the README "GPU Tests" section +crypto/math-cuda/build.rs.Files
.github/workflows/gpu-tests.yml— new workflow (jobgpu-tests).scripts/gpu_test.sh— new on-box runner (cudarc pin + compile + 5 groups, aggregate exit).Makefile— addstest-cuda-fallback,test-prover-cuda,test-prover-comprehensive-cuda(groups 1 & 2 targets already existed).README.md— GPU Tests section + targets table.crypto/math-cuda/build.rs— comment clarifying the PTX/driver version coupling.Before this gates merges (manual GitHub settings — admin, not code)
on: merge_group:only makes the workflow eligible to run in the queue.gpu-teststo the required status checks formain(Settings → Branches/Rulesets). Otherwise the job runs in the queue but a failure won't block the merge. Current required checks are onlyLintandTest.Sequencing: the
gpu-testscheck usually needs to have run once (viaworkflow_dispatch, or after this merges tomain) before it appears in the required-checks picker.merge_groupruns the workflow from the merge-commit ref (base + PR), same pattern as the existing merge-queue-onlytest-prover-comprehensivejob.Repo secrets
VAST_API_KEYandVAST_TEMPLATE_HASHare already set.Status
All 5 groups pass end-to-end on a rented RTX 5090. One real test-harness bug was fixed along the way:
test-cuda-fallbackran its two tests in parallel, racing on the process-global GPU dispatch counters (assert_eq!count mismatches) — fixed by adding--test-threads=1to that Makefile target.Notes
MAX_TRIESoffers across all price bands) would block merges until it clears; the price-band climb, per-host readiness budget + retry-across-offers, and re-scan-for-scarcity loops mitigate it.